The aim of this notebook is to provide an introduction to the many other resources there are online for learning about Python, working with data and much much more! Also, we'll cover how you can pull in any of this code and run it here, and how to deal with annoying dependencies.
Contents here, or just overview.
This course isn't designed to introduce you to programming in a serious way, but Jupyter notebooks are a great environment to learn programming in, and at least 51 different languages are supported. Which language you should learn depends on what you plan to do. For normal scientific programming, and beginners, Python is a good choice, thanks to a massive number of supported packages, ease of readability and active community. However, R is very popular among statisticians, fortran among physicisists and Haskell is used at Edinburgh to teach programming. A good comparison can be found here[meta]. We hope to have kernels available for these in this notebook environment soon(~August 2015).
There are many free resources for learning different programming languages online; some of which are [notebooks][books]. A classic textbook for learning programming in general is the Art and Craft of Programming, also available online for free. There are also many courses available either as free courseware from other universities or from Coursera (or similar): these are summarised [on this page under "Where to Learn"][meta].
So, as an example, say you've decided you want to follow this notebook on Learning Python. You could download it to your local computer, then navigate to the tree view and upload it from there, but that's a lot like hard work. You can just execute the following cell to pull it into that notebook to your home directory:
[meta]: https://www.metacademy.org/roadmaps/rgrosse/basic_programming[books]: https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks#programming-and-computer-science
In [20]:
%%bash
curl http://nbviewer.ipython.org/urls/bitbucket.org/amjoconn/watpy-learning-to-code-with-python/raw/3441274a54c7ff6ff3e37285aafcbbd8cb4774f0/notebook/Learn%20to%20Code%20with%20Python.ipynb > Learn-to-Code-with-Python.ipynb
Taking apart this command:
>
operator to pipe the output of the curl command into a file.Learning-to-code-Python.ipynb
.So now, if you find a notebook file hosted anywhere on the internet you can download it with curl using the address and put it in whatever file you like.
However, if you wanted to pull in an entire repository the command is different, because we're going to use the Git version control system. So, for example, say you want the ipython minibook code, then you need to clone the repository. To do that you'll need the clone url, which is highlighted in the following image:
We would like to clone with https
in this case, so the corresponding url is https://github.com/ipython-books/minibook-code.git
. Now, we just need to use the following git
command:
In [23]:
%%bash
git clone https://github.com/ipython-books/minibook-code.git
Which has created a directory containing everything in that repository, including the entire revision history:
In [24]:
cd minibook-code/
In [25]:
ls
The following is an example of the git log (extremely short as the code in this repository has just been migrated from another project):
In [36]:
%%bash
git log
In [37]:
cd ..
The following cell will pull in the following repositories useful for learning Python and put them in a subdirectoy learning-python
:
In [42]:
%%bash
mkdir learning-python
cd learning-python
git clone https://github.com/ehmatthes/intro_programming.git
git clone https://github.com/yoavram/CS1001.py.git
Here are some other things that may be useful:
Python is a popular language for Science and data science, and many of the most popular notebooks are data science using Python. Therefore, there are also some very good introductions to data science available as IPython notebooks.
The following cell will populate a repository python-data-science
with some of the best resources for learning data science techniques in Python:
In [45]:
%%bash
mkdir python-data-science
cd python-data-science
git clone https://github.com/nborwankar/LearnDataScience.git Learn-Data-Science
curl https://raw.githubusercontent.com/mwaskom/Psych216/master/week6_tutorial.ipynb > A-Tutorial-on-Model-Reliability.ipynb
mkdir Holoviews-tutorials
git clone https://github.com/ioam/holoviews.git
mv holoviews/doc/Tutorials/* Holoviews-tutorials/
rm -rf holoviews
git clone https://bitbucket.org/hrojas/learn-pandas.git An-Introduction-To-Pandas
git clone https://github.com/amueller/tutorial_ml_gkbionics.git Simple-Machine-Learning-with-Scikit-Learn
git clone https://github.com/mwaskom/Psych216.git Statistics-and-Data-Analysis-in-Python
git clone https://github.com/ogrisel/parallel_ml_tutorial.git Paralel-Machine-Learning-with-Scikit-Learn
git clone https://github.com/ResearchComputing/Meetup-Fall-2013.git
mkdir Python-for-Data-Analysis
mv Meetup-Fall-2013/python/* Python-for-Data-Analysis/
rm -rf Meetup-Fall-2013
The following are recommendations by me, Gavin Gray, and not necessarily the opinion of the course, Edinburgh University or anyone else. I just wanted to provide some recommendations of useful resources and tools if you would like to approach some more advanced concepts in data science.
First, I have to recommend Probabilistic Programming and Bayesian Methods for Hackers. If you're not familiar with probabilistic programming, then the title will make no sense to you. Probabilistic programming is a way of separating the model building part of data science from what is called the inference part of data science, which usually is the algorithm that actually makes the prediction. Usually, you'd like your model to be a little bit more complicated, because the real world is a complicated place, but to write the algorithm that you'd need to be able to make predictions using this model would take a long time. There are some drawbacks to this method, such as problems with scaling to large datasets, but these are discussed well in the book itself, along with what probabilistic programming is and how it works. Practically speaking, if this is applied well, it is a very good way of doing Bayesian statistics and I would say it's a very good way to attack the kinds of problems people would normally use statistics for. Finally, note that if your model is correct and you can do inference then you will make the best predictions possible (given the data); for an example of this, see Iain Murray's Dark Worlds blog post.
If you encounter terms that you don't understand, or a concept you've heard of but haven't quite got the hang of yet, then probably the best place to go is Metacademy. The idea is that it is a "package manager for knowledge"; a package manager being something that keeps track of packages you've installed (typically on linux) and makes sure that you've already installed everything that a given package depends on to work. Applied to knowledge, that means you get a list of dependencies you can tick off before you reach the thing you want to understand, rather than reading the same page in a textbook over and over wondering why you don't understand a step in the reasoning. Also, they only link to other resources, rather than trying to reinvent the wheel. So, you get the best resource for a given topic, which is often a pdf of open access lecture notes that just happens to be written very well, or a pre-print textbook that you can download for free. And, they give multiple resources so you can try the second if you don't like the style in the first.
I fully endorse this list of notebooks on various topics. Particularly the parallel machine learning tutorial (which is added if you ran the cell above for python-data-science) was very useful to me when I was working on a biological data science project where we had to speed up processing by parallelising. In addition to that, since Oliver Grisel is a scikit-learn dev he shows how to use all of the scikit-learn tools quickly and efficiently, and highlights some potential pitfalls. Also, the notebook on d3.js may come in useful if you would like one day to make something like this.
If you would like to look at the code from research papers, one of the best resources available (and getting better rapidly) is GitXiv. The idea is to match papers with code that will replicate the results of the paper, thus leading to reproducible science. One nice thing about the way they are doing it is the first person to reproduce the results of the paper can simply link their code to the paper, so even if the original authors don't release their code, the code can still make it into the open. Many of these have IPython notebooks.
Finally, (and this is very biased) there are some great notebooks on deep learning available; making it relatively easy to do some difficult things. For instance, if you saw the deep dream blog post by Google recently, did you know that you can grab a notebook with all of the code to make those images? More recently, there has been work redrawing images in the style of other images (usually famous paintings) and there are notebooks on how to do this. Playing with these is fun, but you will have problems running them on our server, or on your own machine. To run these, you really need a computer with a powerful GPU and the appropriate drivers set up. One easy way to do this would be using Amazon's EC2, and this would involve setting up the remote server with an IPython notebook server and using an ssh tunnel to access it on your local machine. That's beyond our scope, but there are tutorials online that will cover it.
In [55]:
%%bash
mkdir recommendations
cd recommendations
git clone https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers.git
git clone https://github.com/rlabbe/Kalman-and-Bayesian-Filters-in-Python.git
git clone https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition.git
git clone https://github.com/jseabold/538model.git
curl https://dato.com/learn/gallery/notebooks/graph_analytics_movies.ipynb > Seven-Degrees-of-Kevin-Bacon.ipynb
git clone https://github.com/ellisonbg/talk-2014-strata-sc.git
curl http://norvig.com/ipython/xkcd1313.ipynb > Norvig-regex-golf.ipynb
In [1]:
import pymc
So what's happened is we've tried to import the package PyMC, but we haven't installed it yet so we get an ImportError
. Normally, if you were on you're own computer you could just run the following:
In [3]:
!pip3 install pymc
As you can see, this fails because we don't have sufficient privileges. On your own computer you could use sudo. We can get around this by installing to our user account with the --user
flag:
In [4]:
!pip3 install --user pymc
Unfortunately, importing will still fail:
In [5]:
import pymc
When Python tries to install a package, it looks in a few places, and it turns out that when you install something with the --user
flag it is installed to your user account at ~/.local/
:
In [11]:
!ls ~/.local/lib/python3.4/site-packages
And Python is only looking in:
In [12]:
import sys
sys.path
Out[12]:
However, if we add the above path to Python's path, we will be able to import the package:
In [18]:
sys.path.append("/home/gngdb/.local/lib/python3.4/site-packages/")
In [19]:
import pymc
Although, this will only persist in this notebook session after we've run the sys.path.append
command. In a new notebook, we'll have to run the sys.path.append
command again. But, this will include any packages we've installed using pip install --user
, so it is a fairly useful way to install extra packages. For some packages, this can also be done by just cloning the git repository and adding this directory to your path using sys.path.append
again.